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Abstract 

Reinforcement learning methods perform well in many 
domains where a single agent needs to take a sequence of 
actions to perform a task. These methods use sequences 
of single- time- step rewards to create a policy that tries to 
maximize a time-extended utility, which is a ( possibly dis- 
counted) sum of these rewards. In this paper we build on our 
previous work showing how these methods can be extended 
to a multi-agent environment where each agent creates its 
own policy that works towards maximizing a time-extended 
global utility over all agents 9 actions. We show improved 
methods for creating time-extended utilities for the agents 
that are both " aligned ” with the global utility and “team- 
able.” We then show how to create single -time -step rewards 
while avoiding the pitfall of having rewards aligned with the 
global reward leading to utilities not aligned with the global 
utility. Finally, we apply these reward functions to the multi- 
agent Gridworld problem . We explicitly quantify a utility ’s 
leamability and alignment, and show that reinforcement 
learning agents using the prescribed reward functions suc- 
cessfully tradeoff leamability and alignment. As a result 
they outperform both global ( e.g “team games”) and lo- 
cal (e.g., “perfectly leamable”) reinforcement learning so- 
lutions by as much as an order of magnitude. 


1. Introduction 

There are many problems which can only be properly ad- 
dressed by having a set of autonomous agents act indepen- 
dently and have their joint sequence of actions maximize a 
pre-set global utility function. Examples of such problems 
include control of a constellation of satellites, construc- 
tion of distributed algorithms, routing over a data network, 
and control of a collection of planetary exploration vehi- 
cles (e.g., rovers on Mars, or submersibles under Europa’s 
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ice caps). In such problems, agents have to solve two credit 
assignment problems at once. First, an agent has to figure 
out how an action taken now affects future rewards. This 
temporal credit assignment problem has been dealt with ex- 
tensively in the single agent context; there are many rein- 
forcement learning systems [10], (e.g., Q-leamers [13]) that 
have successfully been applied to real world problems [1]. 
Second, an agent has to be able to choose actions that will, 
when combined with the actions of all other agents, lead to 
good values of the global utility. This structural credit as- 
signment problem is difficult and is usually handled by hav- 
ing either each agent receiving the global utility as their pri- 
vate utility (e.g., “team” games [2]), or of imposing exter- 
nal mechanisms (e.g., contracts, auctions) that encourage 
the agents to work together [4, 8]. This paper addresses 
the structural credit assignment problem by designing pri- 
vate utilities for the agents that are both aligned with the 
global utility, and easier for the agents to learn to maximize. 
The agents will then use Q-leamers to address the temporal 
credit assignment problem. 

In earlier work, we discussed how time-extended utilities 
can be applied in a multi-agent system [1 1]. In this work, we 
extend that work by providing rewards whose undiscounted 
sums approximate those utilities, by explicitly computing 
the degree of alignedness between agent utilities and the 
global utility, and by computing the signal-to-noise proper- 
ties of the derived utilities. We also provide a new utility 
that significantly outperforms the previous ones, especially 
in domains with hundreds of agents. In Section 2, we pro- 
vide a summary how to make utilities that resolve the struc- 
tural credit assignment problem. In Section 3 we discuss 
how to devise rewards that deal with the structural credit 
assignment problem for a single time step, while avoiding 
a common difficulty where using rewards aligned with the 
global reward leads to utilities not aligned with the global 
utility. In Section 4, we describe token collection in the 
Gridworld problem domain and develop agents’ private util- 
ities that allow agents using reinforcement learning to re- 


solve both credit assignment problems simultaneously. In 
Section 5, we present simulation results that show that the 
utilities presented here possess high alignedness and iearn- 
ability as compared to traditional approaches, and lead to 
solutions that significantly outperform those traditional ap- 
proaches. 


2. Factored and Learnable Utilities 


In this work, we focus on multi-agent systems that aim 
to maximize a global utility function, G(z ), which is a func- 
tion of the joint move of all agents in the system, z. Instead 
of maximizing G(z ) directly, each agent, p, tries to maxi- 
mize its private utility function g v (z). Our goal is to create 
private utility functions that will cause the multi-agent sys- 
tem to produce high values of G(z ). Note that in many sys- 
tems, an individual agent 77 will only influence some of the 
components of z. We will use the notation z v to refer to the 
parts of z that are dependent on the actions of rj. The vec- 
tor z v is the same size as z and is equal to z except that all 
the components that do not depend on 77 are set to zero. Note 
that this subscripted vector notation is not the same as a tra- 
ditional index to a vector since z and z^ have the same num- 
ber of components. 

There are two properties that are crucial to producing 
systems in which agents acting to optimize their own pri- 
vate utilities will also optimize the provided global utility. 
The first of these concerns “aligning” the private utilities of 
the agents with the global utility. Formally, a system is fully 
factored when for each agent 77 : 

g v (z) > g v (z') # G(z)>G(z') 

Vz, z s.t. z - z v = z' — z^ . 

Intuitively, for all pairs of states z and z f that differ only for 
agent 77 , a change in p’s state that increases its private util- 
ity cannot decrease the global utility. As a trivial example, 
any system in which all the private utility functions equal 
G is fully factored [2]. In general though, one is more con- 
cerned with the degree of factoredness for a given utility 
function than for full factoredness. To address that concern, 
we define the degree of factoredness for a given utility func- 
tion g v as: 
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where again, z f denotes all states for which z — z v = z f ~z f r} 
and u[x] is the unit step function, equal to 1 if x > 0 . 

The second property, called learnability, measures the 
dependence of a utility on the actions of a particular agent 
as opposed to all the other agents. Formally we can quan- 
tify the learnability of utility g r] , in the vicinity of z as the 
expected value over a new set of actions, z\ of g^ s change 


in magnitude caused by the change in r^’s action divided by 
g v ' s change in magnitude caused by the change in actions 
of all the other agents: 
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where £?[•] is the expectation operator. So at a given state 
z, the higher the learnability, the more g v (z) depends on 
the move of agent p, i.e., the better the associated signal-to- 
noise ratio for p. Intuitively then, higher learnability means 
it is easier for p to achieve a large values of its utility. Note 
in the factored team-game example above, the utility of each 
agent depended on the actions of all the agents. Such sys- 
tems often suffer from low signal-to-noise, a problem that 
get progressively worse as the size of the system grows. 


2.1. Designing Agent Utilities 

Though the need for designing factored and learnable 
utilities for the agents is highlighted above, in general it is 
not possible for the utilities both to be factored and to have 
infinite learnability (i.e., no dependence of any g v on any 
agent other than p) for all of its agents [15]. However, con- 
sider a family of utility functions, called Wonderful Life 
(WL) utility functions. For each agent p, the WL utility is 
given by: 

WLU . £ = G(z) - G(z -z v + c) (3) 

where c is an arbitrary vector. For any choice of c, WL utili- 
ties have been shown to be factored [15]. Furthermore, it can 
be proven that in many circumstances, especially in large 
problems, A Vt WLu( z ) > A^^z), i.e., WLU has higher 
learnability than does a team game [15]. This is mainly 
due to the second term of the WLU, which removes part of 
the effect of other agents (i.e., noise) from p’s utility (how 
much noise is removed depends on the domain). Though all 
WL utilities are factored regardless of the choice of c, the 
selection of c affects their learnability. Therefore, in prac- 
tice matching the proper value of c to the domain greatly 
improves the performance of the system [15]. This paper 
will address how to handle the setting of c in three sepa- 
rate ways: ( 1 ) setting c to the zero vector, ( 2 ) setting c to the 
expected value of z v , and (3) taking the expected value of 
WLU over c. 

The first method is to simply set c to the zero vector al- 
lowing the WLU to be expressed as: 

WLU = G(z) - G(z - Zrj) (4) 

In many circumstances this method is equivalent to remov- 
ing that agent from the system, hence the name of this util- 
ity function. For such a c, WLU is closely related to the eco- 
nomics technique of “endogenizing a player’s (agent’s) ex- 
ternalities” and Vickrey tolls [12]. WLU also may appear to 



have some similarities to Groves’ mechanism [3] in mecha- 
nism design, though Groves’ mechanism actually produces 
a team game by subtracting out a player’s benefit already re- 
ceived from public goods. 

The second method for setting c is to set it to the ex- 
pected value of z v instead of the zero vector In this case the 
WLU can be expressed as: 

WLU^ = G(z) - G(z -z^ + z^) (5) 

where z ^ = E[z v \z^ v ]) gives the expected value for the ac- 
tion of z v given the actions of all other agents. This choice 
of c usually results in higher leamability than the zero vec- 
tor [15]. It is also often easy to compute and is still use- 
ful even if the expectation can only be approximated. Note 
that in addition to offering better leamability, the WLU has 
an additional advantage over the global utility: in many in- 
stances, its computation only requires partial (e.g., local) in- 
formation. Indeed, either the information required for the 
computation of the global utility, or the global utility it- 
self has to be broadcast which can put a heavy communica- 
tion burden on the system, not to mention create a central- 
ized, single point of failure. However WLU generally needs 
much less information because many components’ impact 
cancel out (e.g., they are unchanged in both the first and sec- 
ond terms of the WLU) and therefore never need to be ob- 
served by an agent In Section 4.2 we show this to be the 
case for the Gridworld problem. 

The third method for addressing how to set c is to take 
the expected value of the WLU over the values of c that are 
possible actions of r). We call this utility the Expected WL 
Utility (EWU): 

EWU, = E : ,{WLUZ =Z '”(z)lz-z r) } 

= E, [G(z) - G((z - + z‘J\z - 

= G(z)-E 2 ,[G(z-z 7 , + z' v )\z-z v }.(6) 

Instead of setting c to a fixed value, the EWU integrates over 
all values of c that equal an action of rj (denoted z' ). This 
computation theoretically results in higher leamability than 
the WL utility [15], though it is generally difficult to com- 
pute and often needs to be approximated. 

3. Time-Extended Rewards 

Often we face the task of creating a policy to determine 
a sequence of actions, which requires us to break down 
a time-extended utility into single-time-step rewards. Con- 
sider a system, where the global utility, G, is a function of a 
sequence of actions, A , of all the agents. (-4 is an rj by t ma- 
trix of actions. We will use A v to represent the actions of 
agent rj across all time, A t to represent the actions of all 
agents at time t , and A v , t to represent the actions of agent 
r] at time t . All such matrices have the same dimensionality 


as A where the non-used elements are set to zero.) Also as- 
sume that the global utility is an undiscounted sum of global 
rewards (GRs): G(A) — J2 t GR t (A). Since the global util- 
ity is a sum of rewards, we can attempt to maximize it by 
having each agent use a Sarsa learner 1 , receiving a global 
reward at every time step. However, as discussed earlier, 
this approach (i.e., team game) suffers from poor learnabil- 
ity, particularly if there are many agents in the system. 

Instead we want to use a WUU-based reward, and the di- 
rect approach, which we call the naive WLR is given by: 

NWLR^(A) = GR t (A) - GR t (A - A„, t + c* t ) (7) 

where c njt is 77 ’s a fixed action for time step t . This reward is 
factored at time step t, since 77 ’s actions at time step t cannot 
affect the value of the second term. However this reward is 
not factored through time (i.e., the sum of these rewards fac- 
tored with GR does not produce a utility factored with the 
global utility). To illustrate this problem, consider a simple 
two-time-step, multi-agent problem where each agent can 
take one of two actions at each time step. For an agent 77 , we 
can draw a reward graph showing the outcome of its actions 
given the actions of the other agents. Potentially agent rj 
could have a different reward graph for each possible com- 
bination of actions of all of the other agents. Consider one 
of these reward graphs as illustrated in Figure 1 (top), show- 
ing how the global reward values depend on 77 ’s actions (this 
graph actually comes for a Gridworld example problem dis- 
cussed later in Section 4 though the problem details are not 
needed for the analysis here). 

Figure 1 shows that if agent 77 moves left on the first time 
step, the global reward will be ten on the first time step and 
zero on the second time step. If the agent moves right on the 
first time step the global reward will be zero for the first time 
step, but it will be either ten or fifteen on the second time 
step depending on whether its second action is left or right 
respectively. When the actions of all of the other agents are 
held constant, if agent 77 is using a Sarsa learner with the 
global reward, it should form the policy of moving right for 
both time steps, maximizing the global utility: the sum of 
global rewards. However consider the values produced by 
the NWLR, shown in Figure 1 (bottom). If represents 
the go-right action, then the NWLR will equal GR, for the 
first time step, but will be different if it takes the go-right 
action on the first time step. Instead of ten or fifteen, the 
NWLR will evaluate to negative five or zero on the second 
time step. If the agent is using a Sarsa learner, we would ex- 
pect the agent to take the left action at the first time step, 
which is sub-optimal with respect to the global utility. This 
mismatch of factoredness between rewards and utilities can 
be elucidated by focusing on the update rule for a simple 


1 A Sarsa learner is used in this example for its simplicity. In the exper- 
iments performed in the results section, Q-leamers are used instead. 



Figure 1. Agent q makes a choice of two ac- 
tions for each of the two time steps. Val- 
ues on the top graph represent the global re- 
wards. Curved arrows show the values that 
are subtracted out to form the WL reward 
shown on the bottom graph. The NWLR is 
factored for a single time step since the 
NWLR and the GR result in the same order- 
ing of values for each pair of actions at each 
time step. However, this form of WL reward 
is not factored through time since the NWLR 
and the GR give different orderings for the 
sum of rewards (i.e., the resulting utilities are 
not factored). 


(deterministic and undiscounted) Sarsa learner: 

Q($u O't ) ~ Rt A Q(s t j r i 1 a t ^i) ( 8 ) 

where the predicted sum of future rewards after taking ac- 
tion a t in state s t is calculated by adding the immediate re- 
wards to the predicted sum of rewards for the next action. 
Using the NWLR^j* (A) in our example, the Sarsa update 
rule becomes: 

a t) — GR t (A) — GRt(A — Aqj -F c^t) 

+Q{st+i (A), a i+ i) (9) 

The problem arises in being dependent on the cur- 

rent action, a t . That is not a problem in itself, but consider 
what <9(s t+1 (>l),a t+ i) is approximating: 

]T GRt. (A) - Y, GR *' ( A ~ A v,t' + <V) • (10) 

t'>t t'>t 

The second sum is dependent on the current action a t , caus- 
ing the policy not to be factored. Even though the immedi- 



Figure 2. Agent rj makes a choice of two ac- 
tions for each of the two time steps. Val- 
ues on the top graph represent the global re- 
wards. Curved arrows show the values that 
are subtracted out to form the WL reward 
shown on the bottom graph. This form of WL 
reward is factored through time as the WLR 
and GR give the same ordering for the sum 
of rewards resulting form a sequence of ac- 
tions. 


ate reward is factored, the values in the Q-tables are not and 
the agents form a policy that is not aligned with the global 
utility. 

The way to address this problem is to subtract out the full 
sequence of actions, A v , instead of the single action, A Vitj 
in the second term of the WLR and replace it with a con- 
stant full sequence of actions: 

WLB% t (A) = GR t (A) - GRt(A - A v + <*,) (11) 

where is a full sequence of actions. This version of WLR 
therefore differs from the previous version in that: 

1. The subtracted single action A Vit is replaced with the 
sequence of actions A v 

2. The constant action c^t is replaced with a constant se- 
quence of actions 

For the example problem the values produced by this util- 
ity are shown in Figure 2 (bottom). In this case, c is set to 
the < Tight-right’ , sequence of actions (i.e., 15 is subtracted 
from GR). Note that now the sum of rewards for both GR 
and WLR have the same ordering, that is the two utilities 
are factored. With this utility an agent using a Sarsa learner 



forms the correct policy. In general though, this utility may 
be difficult to compute. It requires predicting the outcome of 
a sequence of rewards. In Section 4 we alleviate this prob- 
lem by defining the a virtual “null” action for agents (e.g., 
Ctj is set to the zero vector), representing the agent being re- 
moved from the system. 

4. Multi-agent Gridworld Problem 

The single-agent Gridworld Problem [ 10 ] is a Markov 
Decision Process that is well known in the reinforcement 
learning community. In this problem, an agent navigates 
about a two-dimensional n x n grid, by moving a distance 
of one grid square in one of four directions: up, down, right 
or right The state of an agent is the grid square it is on, 
and the reward an agent receives depends on the grid square 
it moves to. This paper uses an episodic, finite- horizon [ 5 ] 
model of the problem. In this model the agent starts at a 
start- state and then moves for a fixed number of time steps. 
At the beginning of each episode, the agent is returned to 
the start state. Reinforcement learners that maximize a sum 
of rewards, such as Q-leaming, can be used in this prob- 
lem. 

To test our multi-agent utilities, we will use a multi-agent 
version of this well known Gridworld Problem. In the multi- 
agent version there are multiple agents navigating the grid 
simultaneously interacting with each others’ rewards. This 
interaction is modeled through the use of tokens. Each token 
has a value between zero and one, and each grid square can 
have at most one token. When an agent moves into a grid 
square, the system receives a reward for the value of the to- 
ken. The agent then removes the token so that a reward will 
no longer be received if it re-enters the same square or when 
another agent enters the grid square. If two agents move into 
the same square at the same time, it is only picked up once. 
The global objective of the Multi-agent Gridworld Problem 
is to collect the highest aggregated value of tokens in a fixed 
number of time steps. 

We chose this problem because it is a standard problem 
in reinforcement learning research, and provides a clean 
testbed to compare the various utility functions. In the 
multi-agent version, the agent interactions provide a crit- 
ical study of coordination and interference, as the agents 
have the potential to work at cross-purposes. Each agent at- 
tempting to maximize the value of the tokens it collects, 
can drive the global utility to severely sub-optimal values. 
As such, the design of the private utilities is crucial in this 
problem, and we address this issue below. 

4.1. WLU and EWU for Gridworld 

In this problem, the global utility is a function of the 
agents actions. A, the value of the tokens at each location, 


0 (0 x ,y giving the value of the token at location (x,y) ), 
and Lq, the initial locations of the agents. The global utility 
G(A, ©, L 0 ) returns the value of the tokens received from a 
sequence of actions: 

G(A, 0 , Lq) = ^2 Lq). (12) 

where I XiV (A, Lq) is an indicator function return- 
ing one if any agent entered the location (x, y) and zero 
otherwise. The pseudo-code for I XiV (A, Lq) can be ex- 
pressed as follows: 

(A, Lq ) : 
for each 77 
S + — Lq 
for each t 

if s v =(x,y) return 1 
s 6 (s,A t ) 
return 0 

Since 0 and Lq are always constant for a single ex- 
periment, 0 and Lq will be omitted from any function 
parameter in the remainder of this paper to simplify nota- 
tion (i.e., instead of using z = 0 , Lq, A, we will simply use 
z = A). 

Based on the definition and global utility given above, 
the EWU (given in Equation 6) becomes: 

EWUr,(A) = G{A) - Y,PA' V G{A - A v + AJ ( 13 ) 

where the A^s are the possible action sequences agent 77 can 
take. The second term in the equation is the expected value 
of the global utility over all the possible sequences of ac- 
tions for agent 77. 

Now, let us formulate the WL utilities for this domain. 
First, setting to the zero matrix, we obtain WL utility 
where the agent is removed from the system 2 : 

WLU°(A) = G(A)-G(A-A r , + c v ) 

= G(A)-G(A-A V ). ( 14 ) 

This utility returns an agent’s contribution to the global util- 
ity. Note, this utility differs from one where the values of 
the tokens present in the locations visited by the agent are 
summed (i.e., a utility based on an agent’s immediate lo- 
cal effects). WLU 0 gives the value of the tokens in loca- 
tions not visited by other agents , i.e., the values of token 
that would not have been picked up had agent 77 not been 
in the system. This provides the marginal impact for that 
agent. 


2 From here onward, we will refer to WLU C 0 as W LU° to simplify 
the notation. 



Next, let us define the WL utility resulting from agent rj 
taking the virtual average action, where it partially takes all 
possible actions 3 : 

WLU“(z) = G(A) - G(A -A v + A^) (15) 

where A^ ve is the average sequence of actions. Because in 
practice A^ ve can be hard to define, we discuss an alterna- 
tive in the next section. 


4.2. WL and EW Rewards for Gridworld 


WLU and EWU are based on the performance over a 
full episode, and therefore are problematic to use directly in 
practice 4 . We therefore introduce single time step rewards 
whose undiscounted sums form these utilities. First, let us 
decompose an arbitrary utility U in the following manner: 

U(A) = £ U(A<t+i) - U(A <t ). (16) 

t 

where A <t = Ylt'<t act i° n matrix representing 

all of the actions taken before time t. Now, it is possible to 
represent the single time step reward R t by: 

Rt(A) - U(A <t+1 ) - U(A <t ) (17) 

Now we can generate the four single-time-step reward 
versions of the four utilities 5 : 


GR t (A) 

EWR n , t (A) 


WLR° Vtt (z) 

WLRl t (z) 


= G(A <t+1 ) - G(A <t ) (18) 

= GR(A ) 

-Y^p A ,GR t (A-A v + ^(19) 

= GR t (A) - GRt{A - A v ) (20) 

= GR t (A) 

-GR t (A-A v + A a n ve ) (21) 


While we would like to use EWR and W LR a precisely 
as formulated above, these rewards tend to be difficult to 
compute exactly. Since the set of action sequences grows 
exponentially with t , it is computational expensive to sum 
over all possible action sequences as needed to compute 
EWR and to compute the matrix A^ V€ used in WLR a . In- 
stead we will approximate EWR and WLR a , using only 
the previous action instead of the entire sequence of actions: 


EWR Vit (A) = GR(A) 

Tpa'GMA - Arjj + A'^ t) 

A v,t 

WLRl t {z) = GR t (A)-GR t (A-A v , t +A™ e ) 


3 To simplify notation in what follows we refer to W LU c ~ Ar i (utility 

obtained by setting c to the average action sequence of 77 as WLU a . 

4 Providing a reward of zero in all but the final step where the full utility 
is given as a reward is an undesirable solution because it makes filling 
the Q-table an impossibly difficult task. 

5 In the actual implementation there are some tie breaking rules if more 
than one agent goes into the same square at the same time. 


where A* v * is the average single action. This is a virtual ac- 
tion defined as going in all four directions at once. Note 
that these utilities are no longer factored through time since 
only rfs action for a single time step is subtracted out. How- 
ever we will show in Section 5 that in practice these utilities 
are very close to being factored and are very effective. 

Directly computing the utilities represented in Equa- 
tions 18-21 requires knowledge about the state of the entire 
system. However, to compute W'LR 0 an agent only needs 
to observe all the squares to which it has been. All the other 
terms cancel out. In a domain such as Mars Rovers this 
could be done by laying a trail of sensors along its path. The 
utilities EWR and WLR a have similar requirements ex- 
cept they have to observe a few squares around every square 
they have been. Again in the Mars Rover domain this could 
be done by laying a trail of sensors that have a small ra- 
dius of observation. The only utility that truly requires full 
observability is the team game utility, G. 

5. Experimental Results 

To evaluate the effectiveness of the collective-based ap- 
proach in the Multi-agent Gridworld, we conducted exper- 
iments in token worlds with 10, 85 and 200 agents. The 
token worlds had a similar distribution of tokens, where 
the “highly valued” tokens were concentrated in one cor- 
ner, with a second concentration near the center where the 
rovers were initially located. For an nxn grid the value of a 
token in position (x, y) was (x A y)/n — 1 when (x Ay)/ n 
was greater than 0.4. In addition the tokens at locations 
(m/2, m/2 - 1) and (m/2 4- 1, m/2 - 1) were set to 0.8. 
All other tokens had a value of zero. Agents start an episode 
at location (m/2, m/2). To keep the approximate difficulty 
of the token collection problem constant with respect to the 
number of agents, the ratio of the number of grid squares 
to number of agents was held constant. The size of the to- 
ken world was 10x10 for ten rovers, 29x29 for 85 agents 
(e.g., 8.4 times larger than for 10 agents), and 44x44 for 
200 agents (e.g., 20 times larger than for 10 agents). In all 
the experiments, the agents used Q-learners to learn their 
policy (we expect a Sarsa learner to produce similar re- 
sults). Each run consisted of 1000 episodes of 10, 29 and 
44 steps respectively for the the 10, 85 and 2 00 agent sys- 
tems. There were 100 runs per each 10 and 85 agent experi- 
ment (e.g., for 10 agents, we had 100 runs of 1000 episodes 
of 10 time steps) and 24 runs per each 200 agent experi- 
ment. The discount rate was 0.95 and the learning rate was 
set to 1 /(l 4-0.0002*u 5>a ) where v SA is the number of times 
an agent took action a in state 5 . Given the Q- values, the ac- 
tion were chosen with Boltzmann selector with k = 50 and 
tables were initially set to zero as traditionally done to trade- 
off exploration vs. exploitation [10]. Table 1 shows the re- 
sults of a ten rover system for the five utilities. 


Agent 

Normalized 

Deviation 

Convergence 

Agent 

Factoredness 

Leamability 

Utitlity 

World Utility 

in Mean 

Time 

Utility 

(in %) 


W LU a 

0.998 

0.001 

40 

WLU a 

99.0 ± 0.078 

1.78 ±0.03 

WLU° 

0.993 

0.002 

70 

WLU° 

100.0 ± 0.0 

0.96 ± 0.03 

EWU 

0.97 

0.002 

110 

EWU 

99.3 ± 0.088 

1.15 ± 0.03 

G 

0.37 

0.02 

770 

G 

100.0 ± 0.0 

0.2 ± 0.0026 

PLU 

0.29 

0.02 

10 

PLU 

86.2 ± 0.66 

00 

(Random 

0.34 

0.1) 


Table 2. Factoredness and Leamability Esti- 
mates for 1 0 Agents 

Table 1. 

Gridworld Performance for 1 0 Agents 


The performance of five private utility functions was 
tested: (i) the Perfectly Learnable Utility (PLU), where each 
agent receives the weighted total of the tokens that it alone 
collected. It is the natural extension of the single agent prob- 
lem, and represents the optimal utility in the single rover do- 
main. The PLU is a function of the moves of only a single 
agent and therefore as infinite leamability, but is not gener- 
ally factored, (ii) the Team Game (TG) utility, where each 
agent received the full global utility. It is the opposite ex- 
treme of the PLU since it is fully factored, but has very low 
leamability. (iii) the WL° utility, where c is set to zero. Intu- 
itively, this utility computes the contribution an agent makes 
to the token collection, by looking at the difference in the 
total token collection with and without that agent, (iv) the 
WL a utility, where c is set to A ave , representing the dif- 
ference between the utility value resulting from an agent’s 
actual action and its “smeared” action; and (v) the EWU, 
where the agent’s contribution is computed as the differ- 
ence between the action it took and its expected action. 

The performance measure in these figures is “normal- 
ized” global utility given by — . This normalized 

■£~'x,y v 

utility provides the fraction of token values that was col- 
lected by the agents (a value of one means all available to- 
kens were collected). The deviation in the mean gives the 
where N is the number of runs (N = 100 for this ex- 
periment). The convergence times is the time taken to reach 
-9(U max — Urand ) 4* Ur and- The results show that PLU pro- 
duced poor results, results that were indeed worse than ran- 
dom actions. This is caused by all agents aiming to acquire 
the most valuable tokens, and congregating towards the cor- 
ner and center of the world where such tokens are located. 
In this case agents using the PLU competed, rather than co- 
operated with one another. Note however, that the conver- 
gence time was extremely rapid. The agents using TG fared 
marginally better, but their learning was slow. This system 
was plagued by the signal-to-noise problem associated with 
each agent receiving the full global reward for each indi- 
vidual action they took. In contrast, agents using WL° and 
EWU performed very well, and agents using WL a per- 
formed almost optimally. In each of these three cases, the 


reinforcement signal the agents received was both factored 
and showed how their actions affected the global reward 
more clearly than did the TG reinforcement signal. 

We can gain some insight into these results by calculat- 
ing the leamability and factoredness of these utilities, using 
Monte-Carlo methods. Factoredness is computed by taking 
the action, A 1}t for an agent 77 at a random time t and replac- 
ing it with a random action If n samples are taken, an 
estimate for the degree of factoredness introduced in Equa- 
tion 1 for a private utility g is given by: 

u l(9( A ) -9{A- A v<t 

nd i= 1 

(G(.4) - G(A - Ar)d + 

where u is the unit step function (output of 1 if argument 
is strictly greater than zero), R % v t is the ith random action 
for agent 77, and rid is the number of times the random ac- 
tion caused a change in G. Similarly leamability can be ap- 
proximated as: 

if ll^)-^-.4,, t + i;;)|| 

where and R 1 — R x v t is the ?'th random action for all agents 
ther than 77. 

Table 2 shows the factoredness and leamability estimates 
respectively, along with the differences in the mean for 
each. The results indicate that all of the utilities except for 
PLU are either perfectly factored or very close to being fac- 
tored. Instead the improved performance of WLU a can be 
attributed to it having higher leamability than either EWU 
or WLU°, a result consistent with the convergence times 
reported in table 1. 

Tables 3 and 4 show the experimental results for the 
larger systems (85 and 200 agents). The results are qual- 
itatively similar to those with 10 agents, though the dif- 
ferences become more pronounced. The team game agents 
have a harder time learning, and perform randomly for the 
200 agent case. Furthermore, the performance of WL a is 
now clearly superior to that of WL°, showing that using the 



Agent 

Normalized 

Deviation 

Utility 

World Utility 

in Mean 

WLU a 

0.84 

0.007 

WLU° 

0.71 

0.008 

EWU 

0.73 

0.009 

G 

0.06 

0.006 

PLU 

0.018 

0.0006 

(Random 

0.058 

0.02) 


Table 3. Gridworld Performance for 85 Agents 


Agent 

Utility 

Normalized 
World Utility 

Deviation 
in Mean 

WLU a 

0.57 

0.01 

WLU° 

0.38 

0.01 

EWU 

0.41 

0.01 

G 

0.025 

0.005 

PLU 

0.007 

0.0002 

(Random 

0.02 

0.02) 


Table 4. Gridworld Performance for 200 Agents 


degree of freedom of being able to use an arbitrary Cfi pro- 
vides significant improvements over solutions aimed at “en- 
dogenizing externalities” (WL°). 

6. Discussion 

In this article we extended previous work on design- 
ing distributed reinforcement learning algorithms for multi- 
agent systems. We decomposed agent utility functions re- 
quiring sequences of actions into single step rewards that 
conserve the salient features of the agent utilities. Further- 
more, we exposed the dangers of some single step rewards 
aligned with a global reward leading to utilities which are 
not aligned with the global utility. Our analysis shows the 
simple steps required to overcome this problem. Our exper- 
imental results were conducted in a gridworld scenario, a 
problem related to many real world problems including ex- 
ploration vehicles trying to maximize aggregate scientific 
data collection (e.g., rovers on the surface of Mars). Further- 
more, we computed the factoredness and learnability for the 
different agent utility functions using Monte-Carlo meth- 
ods. These results shed light on the performance and con- 
vergence times of the different utilities and validated the as- 
sumptions made in the derivation of the utilities. 

The results demonstrate that agent utilities designed to 
have both high factoredness and high learnability outper- 
form both “perfectly leamable” utilities and fully factored 
utilities based on “team games” (e.g., using global utility). 


Even the simplest of our utilities, WL° , showed marked im- 
provement over such utilities, while WL a showed further 
improvements. The factoredness and learnability estimates 
showed that WL-based utilities had high factoredness and 
high learnability, thus using the best features of both team 
games and PLUs. Our current research consists of extend- 
ing these results to domains with partial observability. 
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